Multilingual Resources for Entity Extraction

نویسندگان

  • Stephanie Strassel
  • Alexis Mitchell
چکیده

Progress in human language technology requires increasing amounts of data and annotation in a growing variety of languages. Research in Named Entity extraction is no exception. Linguistic Data Consortium is creating annotated corpora to support information extraction in English, Chinese, Arabic, and other languages for a variety of US Governmentsponsored programs. This paper covers the scope of annotation and research tasks within these programs, describes some of the challenges of multilingual corpus development for entity extraction, and concludes with a description of the corpora developed to support this research.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Shared Resources for Multilingual Information Extraction and Challenges in Named Entity Annotation

Progress in natural language processing requires increasing amounts of data and annotation in a growing variety of languages, and research in named entity extraction is no exception. While the value of richlyannotated, large-scale multilingual corpora is undeniable, costs for producing such data are high, underscoring the value of shared resources. As part of the US Governmentsponsored Automati...

متن کامل

Multilingual and Cross-lingual Timeline Extraction

In this paper we present an approach to extract ordered timelines of events, their participants, locations and times from a set of multilingual and crosslingual data sources. Based on the assumption that event-related information can be recovered from different documents written in different languages, we extend the Cross-document Event Ordering task presented at SemEval 2015 by specifying two ...

متن کامل

Expanding a multilingual media monitoring and information extraction tool to a new language: Swahili

The Europe Media Monitor (EMM) family of applications is a set of multilingual tools that gather, cluster and classify news in currently fifty languages and that extract named entities and quotations (reported speech) from twenty languages. In this paper, we describe the recent effort of adding the African Bantu language Swahili to EMM. EMM is designed in an entirely modular way, allowing plugg...

متن کامل

TIDES Language Resources: A Resource Map for Translingual Information Access

Continuing improvements in human language algorithms, coupled with improvements in digital storage and processing, inspire growing confidence in multilingual information access systems. Systems exist to transcribe broadcast news, segment broadcasts into individual stories and sort them by topic. These technologies, useful in isolation, are now being combined to produce intelligent multilingual ...

متن کامل

Creation of Reusable Components and Language Resources for Named Entity Recognition in Russian

This paper describes the development of the RussIE system in which we experimented with the creation of reusable processing components and language resources for a Russian Information Extraction system. The work was done as part of a multilingual project to adapt existing tools and resources for HLT to new domains and languages. The system was developed within the GATE architecture for language...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003